Aino Peura 28.11.2021 These are week 4 excercises.

Libraries used:

#install.packages("MASS")
#install.packages("corrplot")
#install.packages("Hmisc")
#install.packages("GGally")
library(MASS)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.6.3
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Lets load and explore the boston-dataset:

Boston
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Boston-dataframe describes housing values in suburbs of Boston. It has 506 samples and 14 variables. Following info is from the RDocumentation of the MASS-package:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- indus: proportion of non-retail business acres per town.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
- age: proportion of owner-occupied units built prior to 1940.
- dis: weighted mean of distances to five Boston employment centres.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.
- ptratio: pupil-teacher ratio by town.
- black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.

The first task is to show graphical overview of the data, summaries of the variables and relationships between variables in the data. This will be done with 3 parts: - Graphical overwies of the variables are shown with following:

#With following code, density graphs are counted (density ()) and plotted to show the distribution of variables
par(mfrow=c(2,4))
for(i in 1:14){
  nimi<- colnames(Boston)[i]
  data<- density(Boston[,i])
  plot(data, main = nimi)
}

From quick observation, we can see that by very inaccurate visual interpretation:
- variables that have peak on the small values:
- crim: per capita crime rate by town.
- zn: proportion of residential land zoned for lots over 25,000 sq.ft.
- chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- dis: weighted mean of distances to five Boston employment centres.
- variables that have peak on high values:
- age: proportion of owner-occupied units built prior to 1940.
- ptratio: pupil-teacher ratio by town.
- black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
- variables that are normally distributed:
- lstat: lower status of the population (percent).
- medv: median value of owner-occupied homes in $1000s.
- nox: nitrogen oxides concentration (parts per 10 million).
- rm: average number of rooms per dwelling.
-variables that have two peaks:
- indus: proportion of non-retail business acres per town.
- rad: index of accessibility to radial highways.
- tax: full-value property-tax rate per $10,000.

Then, lets see the summaries of our data:

for(i in 1:14){
  nimi<- colnames(Boston)[i]
  data<- summary(Boston[,i])
  print(nimi)
  print(data)
}
## [1] "crim"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620 
## [1] "zn"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   11.36   12.50  100.00 
## [1] "indus"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.46    5.19    9.69   11.14   18.10   27.74 
## [1] "chas"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06917 0.00000 1.00000 
## [1] "nox"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3850  0.4490  0.5380  0.5547  0.6240  0.8710 
## [1] "rm"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.561   5.886   6.208   6.285   6.623   8.780 
## [1] "age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.90   45.02   77.50   68.57   94.08  100.00 
## [1] "dis"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.130   2.100   3.207   3.795   5.188  12.127 
## [1] "rad"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   9.549  24.000  24.000 
## [1] "tax"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0 
## [1] "ptratio"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00 
## [1] "black"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.32  375.38  391.44  356.67  396.23  396.90 
## [1] "lstat"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.73    6.95   11.36   12.65   16.95   37.97 
## [1] "medv"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

And finally, correlation plot: The hierarchail clustering is used for ordering variables by using “order =”hclust""

par(mfrow=c(1,2))
#Lets count correlation matrix
correlations<-cor(Boston)
#Lets count significances: 
testRes<- cor.mtest(Boston)
#Lets plot it, variables that are blank are unsignificant
corrplot(correlations, p.mat =testRes$p, insig = "blank",method = "number", order = "hclust", addrect = 2, number.cex = 0.75)

From the correlation plot, we can see that (these include only significant correlation values: - variables ptratio, lstat, age, indus, nox, crim, rad and tax seem to correlate positively -variables dis, medv and rm seem to correlate positively with chas, black, and zn and obviously with themselves -chas only correlates positively with rm, medv and dis and negatively with ptratio -The strognest positive correlation is between rad and tax

Dataset scaling

So, the next task is to scale Boston-dataset.

BostonScaled<- as.data.frame(scale(Boston))

#Lets plot summaries of scaled boston dataset and under each column also the uncsaled data so we can see what changed: 
for(i in 1:14){
  nimi<- colnames(BostonScaled)[i]
  data1<- summary(BostonScaled[,i])
  data2<-summary(Boston[,i])
  print(nimi)
  print(data1)
  print(data2)
}
## [1] "crim"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.08204  0.25651  3.61352  3.67708 88.97620 
## [1] "zn"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.48724 -0.48724 -0.48724  0.00000  0.04872  3.80047 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   11.36   12.50  100.00 
## [1] "indus"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5563 -0.8668 -0.2109  0.0000  1.0150  2.4202 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.46    5.19    9.69   11.14   18.10   27.74 
## [1] "chas"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2723 -0.2723 -0.2723  0.0000 -0.2723  3.6648 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06917 0.00000 1.00000 
## [1] "nox"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4644 -0.9121 -0.1441  0.0000  0.5981  2.7296 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3850  0.4490  0.5380  0.5547  0.6240  0.8710 
## [1] "rm"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.8764 -0.5681 -0.1084  0.0000  0.4823  3.5515 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.561   5.886   6.208   6.285   6.623   8.780 
## [1] "age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.3331 -0.8366  0.3171  0.0000  0.9059  1.1164 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.90   45.02   77.50   68.57   94.08  100.00 
## [1] "dis"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2658 -0.8049 -0.2790  0.0000  0.6617  3.9566 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.130   2.100   3.207   3.795   5.188  12.127 
## [1] "rad"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9819 -0.6373 -0.5225  0.0000  1.6596  1.6596 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   9.549  24.000  24.000 
## [1] "tax"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.3127 -0.7668 -0.4642  0.0000  1.5294  1.7964 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   279.0   330.0   408.2   666.0   711.0 
## [1] "ptratio"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.7047 -0.4876  0.2746  0.0000  0.8058  1.6372 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.60   17.40   19.05   18.46   20.20   22.00 
## [1] "black"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.9033  0.2049  0.3808  0.0000  0.4332  0.4406 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.32  375.38  391.44  356.67  396.23  396.90 
## [1] "lstat"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5296 -0.7986 -0.1811  0.0000  0.6024  3.5453 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.73    6.95   11.36   12.65   16.95   37.97 
## [1] "medv"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9063 -0.5989 -0.1449  0.0000  0.2683  2.9865 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

Okay, from values above, we can see that scale-function scaled everything to have mean as 0 and also every value now represents how many standard devitions it is from the mean. These scaled values are actually called as z-score.

Then, the next task is to divide variable crime-rate into quantiles and create a categorical variable called crime:

#Lets count quantiles
bins<- quantile(BostonScaled$crim)

#Lets set neq variable with this ifelse-script. 
#We start with smallest and in no-option we include the test to categorize it for further
BostonScaled$crime<- ifelse(BostonScaled$crim<bins[2], paste("[", round(bins[1], digits = 3), ",", round(bins[2], digits = 3), "]", sep = ""), 
                            ifelse(BostonScaled$crim<bins[3], paste("(", round(bins[2], digits = 3), ",", round(bins[3], digits = 3), "]", sep = ""),
                                   ifelse(BostonScaled$crim<bins[4], paste("(", round(bins[3], digits = 3), ",", round(bins[4], digits = 3), "]", sep = ""),                                               paste("(",round(bins[4], digits = 3), ",", round(bins[5], digits = 3), "]", sep = "") )))

table(BostonScaled$crime)
## 
##   (-0.39,0.007]  (-0.411,-0.39]   (0.007,9.924] [-0.419,-0.411] 
##             126             126             127             127
BostonScaled<- BostonScaled[,2:15]

#In datacamp these were changed into following classes: low, med_low, med_high, high, so, just for clarification, we shall do the same

unique(BostonScaled$crime)
## [1] "[-0.419,-0.411]" "(-0.411,-0.39]"  "(-0.39,0.007]"   "(0.007,9.924]"
BostonScaled$crime<- ifelse(BostonScaled$crime=="[-0.419,-0.411]", "low", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(-0.411,-0.39]", "low_med", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(-0.39,0.007]", "high_med", BostonScaled$crime)
BostonScaled$crime<- ifelse(BostonScaled$crime=="(0.007,9.924]", "high", BostonScaled$crime)
table(BostonScaled$crime)
## 
##     high high_med      low  low_med 
##      127      126      127      126

All good! Okay, now we have created the variable crime, Then, lets divide the data into train and test sets (80% into train, 20% into test:

#Lets create our tran-data subset
trainData<- BostonScaled[(sample(nrow(BostonScaled), size= 0.8*nrow(BostonScaled))),]
#And the leftovers are testdata
testData<- subset(BostonScaled, is.element(rownames(BostonScaled), rownames(trainData))==F)

Lets fit linear discriminant analysis on the train set

So, the next task to do is to fit linear discriminant analysis into he train dataset.

#LDA_analysis
ldadata<- lda(crime ~ ., data = trainData)

#Lets create numeric crime vector maintaining the original arrangement of values
trainData$crime<- ifelse(trainData$crime=="low", 1, ifelse(trainData$crime=="low_med", 2, ifelse(trainData$crime=="high_med", 3, 4)))

#Then the arrows-function from datacamp
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#Lets define classes-variable
classes <- as.numeric(trainData$crime)

#lets create biplot and also the arrows, classes defines colors and also symbols for variables
plot(ldadata, dimen = 2, col = classes, pch = classes)
lda.arrows(ldadata, myscale = 1)

Prediction

First, lets save correct classes of crime-variable in test set:

correct <- testData$crime

testData<- select(testData, -crime)

Then, lets do prediction of correct classes in the test data:

ldaprediction<- predict(ldadata, newdata = testData)
table(correct= correct, predicted = ldaprediction$class)
##           predicted
## correct    high high_med low low_med
##   high       23        0   0       0
##   high_med    1       17   0      10
##   low         0        1  15      13
##   low_med     0        2   7      13

So, for some reason, my predictive function does not predict correctly classes low_med and low, some low values are predicted to be low_med. However, with every other type of variable, predicions are quite accurate:
- 100% high values are predicted correctly
- 64% of high_med values are in correct category
- 57% of low_med values are in correct gategory - BUT only 29% of low values are in low category

Distances

So, lets do bostonScaled2, that is the original Boston dataset scaled

BostonScaled2<-scale(Boston) 

Then, lets count euclidean and manhattan distances of the dataset:

eu<- dist(BostonScaled2, method = "euclidean")#This is the hypothenuse distance, so absolutely the shortest distance
man<- dist(BostonScaled2, method = "manhattan")#This is the sum of individual distances between variables (like in manhattan when you have to navigate between square-shaped houses)

Now, lets do k-means clustering:

#first, lets set seed 
set.seed(56)

#Then, lets count the WCSS (within cluster sum of squares) and define the optimal number of clusters (when this drops drastically)
#We set the maximal number of clusters to be 10, just for the sake of time and computer 
#we do kmeans from k==1 to k ==10, and we have tot.whitinss as a outcome variable that is saved into WCSS
WCSS<- sapply(1:10, function(k){kmeans(BostonScaled2, k)$tot.withinss})

qplot(x=1:10, y=WCSS, geom="line")

#From here we can see that the greatest drop (so biggest variation is somewhere around 2, so optimal number of clusters is 2)

BostonScaled2<- as.data.frame(BostonScaled2)
km<-kmeans(BostonScaled2, centers = 2)
BostonScaled2$cluster<- km$cluster
BostonScaled2$cluster<- as.factor(BostonScaled2$cluster)


ggpairs(BostonScaled2, upper = "blank", diag = "blank", mapping=ggplot2::aes(colour = cluster))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation of the data: Even though our data is in very tiny images, we can see the distribution of colors: Two clusters separate the variable crim beautifully. Also zn is beautifully divided. Same is true for indus, dis and nox variables. However, I don’t think other variables are separating by these two clusters.

Bonus 1

BostonScaled3<- scale(Boston)
km<-kmeans(BostonScaled2, centers = 5)
BostonScaled3<- as.data.frame(BostonScaled3)
BostonScaled3$cluster<- km$cluster

#then, lets perform LDA wth clusters as target classes
LDA<- ldadata<- lda(cluster ~ ., data = BostonScaled3)

#Lets define the function Lda-arrows again. 
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#Lets define classes-variable
classes <- as.numeric(BostonScaled3$cluster)

#lets create biplot and also the arrows, classes defines colors and also symbols for variables
plot(LDA, dimen = 2, col = classes, pch = classes)
lda.arrows(LDA, myscale = 3)

So, lets interprate results. nec, induce. rad are great separators of clusters. Age and crim, also zn are separating clusters 1 and 2 from 4,3 and 5. I would say that age is the strongest linear separator, because it has longest vector.

Super-Bonus

First, lets run the code:

model_predictors <- select(trainData, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
#For some reasons the crime was till included, while it was not in trainData, so we skip the row 1
ldadata$scaling<- ldadata$scaling[2:14,]
dim(ldadata$scaling)
## [1] 13  4
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% ldadata$scaling
matrix_product <- as.data.frame(matrix_product)
#Lets adjust the color
par(mfrow= c(2,2))
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = trainData$crime)
km<- kmeans(trainData, centers = 5)
#Then, lets define the color by cluster
trainData$clusters<- km$cluster
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = trainData$clusters)

Allright, lets do some visual interpretation. Low crime is cluster 1 and high crime is cluster 4. Otherwise, points have different clusters than crime rates.